Multilabel associative classification categorization of MEDLINE articles into MeSH keywords.

نویسندگان

  • Rafal Rak
  • Lukasz A Kurgan
  • Marek Reformat
چکیده

The specific characteristic of classification of medical documents from the MEDLINE database is that each document is assigned to more than one category, which requires a system for multilabel classification. Another major challenge was to develop a scalable method capable of dealing with hundreds of thousand of documents. We proposed a novel system for automated classification of MEDLINE documents to MeSH keywords based on the recently developed data mining algorithm called ACRI, which was modified to accommodate multilabel classification. Five different classification configurations in conjunction with different methods of measuring classification quality were proposed and tested. The extensive experimental comparison showed superiority of methods based on reoccurrence of words in an article over nonrecurrent-based associative classification. The achieved relatively high value of macro F1 (46%) demonstrates the high quality of the proposed system for this challenging dataset. Accuracy of the proposed classifier, defined as the ratio of the sum of TP and TN examples to the total number of examples, reached 90%. Three scenarios were proposed based on the performed tests and different possible objectives. If a goal is to classify the largest number of documents, a configuration that maximizes micro F1 should be chosen. On the other hand, if a system is to work well for categories with a small number of documents, a configuration that maximizes macro F1 is more suitable. A tradeoff can be obtained by using a configuration that optimizes the average between macro and micro F1.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A MEDLINE categorization algorithm

BACKGROUND Categorization is designed to enhance resource description by organizing content description so as to enable the reader to grasp quickly and easily what are the main topics discussed in it. The objective of this work is to propose a categorization algorithm to classify a set of scientific articles indexed with the MeSH thesaurus, and in particular those of the MEDLINE bibliographic d...

متن کامل

Multi-topic Text Categorization Based on Ranking Approach

This paper is devoted to the multi-topic (multilabel) text classification problem. We propose two methods for reduction from ranking to the multi-label case. Unlike existing multi-label classification methods based on reduction from ranking problem, where the complex classification (threshold) function is being defined on the input feature space, in our approach we propose the construction of s...

متن کامل

Distributed Representations for Automating MeSH Indexing

7 Manual MeSH indexing of the millions of journal articles cataloged in 8 PubMed each year has become a daunting and expensive challenge for the 9 National Library of Medicine. While the prospect of automated indexing is 10 tempting, the requisite task of multilabel hierarchical classification is a 11 difficult one. This article explores the possibility of generating distributed 12 vector repre...

متن کامل

Using Discourse Analysis to Improve Text Categorization in MEDLINE

PROBLEM Automatic keyword assignment has been largely studied in medical informatics in the context of the MEDLINE database, both for helping search in MEDLINE and in order to provide an indicative "gist" of the content of an article. Automatic assignment of Medical Subject Headings (MeSH), which is formally an automatic text categorization task, has been proposed using different methods or com...

متن کامل

Query Translation by Text Categorization

We report on the development of a cross language information retrieval system, which translates user queries by categorizing these queries into terms listed in a controlled vocabulary. Unlike usual automatic text categorization systems, which rely on dataintensive models induced from large training data, our automatic text categorization tool applies data-independent classifiers: a vector-space...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • IEEE engineering in medicine and biology magazine : the quarterly magazine of the Engineering in Medicine & Biology Society

دوره 26 2  شماره 

صفحات  -

تاریخ انتشار 2007